The data analyzed here is a set of 284,807 credit card transactions, of which 492 are fraudulent. Each point has 30 features and a label. 28 of the features are the result of a PCA transformation and scaling, effectively anonymizing the transactions. This may have caused data leakage, but that cannot be determined without more information as to how the PCA was performed. Here is where I sourced the data from.
The data are highly imbalanced, so we will need to either find a good way to artificially balance it or use a method that is resistant to imbalance. We will be doing the former, for the sake of practice. Out evaluation metrics will have to be chosen with care, as well.
The packages required for this notebook are all available in the standard Anaconda 3 installation - with one exception. If you with to run this yourself, you will need to install the imbalanced-learn package.
# Import necessary libraries.
#----------------------------------------------------------#
# General data storage and manipulation
import numpy as np
import pandas as pd
from os import listdir
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Preprocessing
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE
# Model selection
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
# The models
from sklearn.linear_model import SGDClassifier
from imblearn.pipeline import make_pipeline, Pipeline
# Model evaluation
from sklearn.metrics import (roc_auc_score, make_scorer, roc_curve, precision_recall_curve,
auc, average_precision_score)
from time import process_time
# Grab the data.
#----------------------------------------------------------#
filepaths = ["Data/" + f for f in listdir("./Data") if f.endswith('.csv')]
data = pd.concat(map(pd.read_csv, filepaths))
There are a few minor issues with the data that we must address before moving on. The first is that two of the features - time and amount - have not yet been scaled. This could impact the performance of our model if not corrected. The second is that the feature matrix and label vector are combined. This is easily resolved.
# Scale the yet unscaled features.
#----------------------------------------------------------#
rob_scaler = RobustScaler()
time_scaled = rob_scaler.fit_transform(data['Time'].values.reshape(-1, 1))
amount_scaled = rob_scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
data.drop(['Time', 'Amount'], axis = 1, inplace = True)
data.insert(0, 'time_scaled', time_scaled)
data.insert(1, 'amount_scaled', amount_scaled)
# Create the feature matrix and label vector.
#----------------------------------------------------------#
X = data.drop('Class', axis = 1)
y = data['Class']
As always, it serves us well to take a look at the data before attempting to fit a model to it. To do so properly we will first randomly undersample the data. We will then produce a corner plot of the data and calculate the correlation between each feature pair.
# Generate the corner plot.
#----------------------------------------------------------#
plot_mat = pd.concat([data.where(data['Class'] == 0).dropna().sample(400),
data.where(data['Class'] == 1).dropna().sample(400)])
g = sns.PairGrid(plot_mat, hue = "Class", vars = X.columns)
g = g.map_diag(sns.kdeplot, lw = 2, shade = True)
g = g.map_lower(sns.scatterplot, alpha = 0.6)
g = g.map_upper(sns.scatterplot, alpha = 0.6)
g = g.add_legend(title = "Class")
plt.savefig("Figures/Corner.png", bbox_inches = "tight");